Chromosome-level genome assembly of the Siberian chipmunk (Tamias sibiricus)

Tamias sibiricus is regarded as one predominant scatter-hoarder that stores their food items both in small scattered caches and underground larder-hoards. This unique behavior, though providing essential seed dispersal services for many plant species worldwide, relies highly on accurate spatial memory and acute sense of olfaction. Here, we assembled a chromosome-scale genome of T. sibiricus using Illumina sequencing, PacBio sequencing and chromosome structure capture technique. The genome was 2.64 Gb in size with scaffold N50 length of 172.61 Mb. A total of 2.59 Gb genome data was anchored and orientated onto 19 chromosomes (ranging from 28.70 to 222.90 Mb) with a mounting rate of up to 98.03%. Meanwhile, 25,311 protein-coding genes were predicted with an average gene length of 32,936 bp, and 94.73% of these genes were functionally annotated. This reference genome will be a valuable resource for in-depth studies on basic biological possess and environmental adaptation of the Siberian chipmunk, as well as promoting comparative genomic analyses with other species within Rodentia.


Background & Summary
The Siberian chipmunk, Tamias sibiricus (Laxmann, 1769) belongs to the subfamily Xerinae, within the family Sciuridae of the order Rodentia 1 . This species is a small, diurnal and ground-dwelling squirrel that lives in mountain and forest habitats with bushy understory 2 . The wild populations of T. sibiricus are naturally distributed in Russia and several east Asian countries (China, Mongolia, Korea and Japan). Meanwhile, this squirrel is one of most popular companion animals because of its attractive appearance and unique behavior 3 . Hence, it has been introduced as pets into European countries for decades and the accidentally escaped individuals have successfully established their populations in the wild 4 . Additionally, as important seed dispersal agents adopting the primary strategies of scatter-and larder-hoarding behavior, T. sibiricus provides essential seed dispersal services in many ecosystems across the world 5 . Over the past decades, studies of T. sibiricus have mainly focused on biology, behavior, ecology, and phylogeography [5][6][7][8] . However, little is known about the genetic basis and mechanism of its environmental adaptation because of limited molecular information.
In the present study, we constructed a high-quality genome assembly for the Siberian chipmunk using the integration of short reads (Illumina sequencing), long reads (PacBio sequencing) and Hi-C reads (proximity ligation chromatin conformation capture). The final assembled genome size of T. sibiricus was 2.64 Gb with the scaffold N50 length of 172.61 Mb. A total of 2.59 Gb assembled genome sequences were successfully anchored on 19 chromosomes. This number of chromosomes was consistent with the outputs of the karyotype ananlysis 9 . 1.03 Gb repetitive sequences were identified, constituting 38.87% of this reference genome. A total of 25,311 protein-coding genes were predicted, and 97.69% of these genes were functionally annotated.

Methods
Sample collection and ethics statement. An adult female specimens of T. sibiricus was originally collected from a forestry farm in Chifeng, Inner Mongolia Autonomous Region of China (41°39′N, 118°22′E) in October 2020. The sample was then maintained at Qufu Normal University, and stored at −80°C prior to DNA and RNA extraction. All experiments were performed according to the Guidelines for the Care and Use of Sequencing. Muscle tissue of the female body was prepared for transcriptome, Illumina, PacBio whole-genome and Hi-C sequencing. All sequencing analyses were performed by the Shanghai Origingene Bio-pharm Technology Co. Ltd. (Shanghai, China). Genome DNA was extracted using a Blood & Cell Culture DNA Mini Kit (Qiagen, Germany). Quantity and quality of the total DNA were determined by 2100 Bioanalyzer (Agilent, USA) and Qubit 3.0 Fluorometer (Invitrogen, USA), respectively. Total RNA was isolated using a TRIzol Total RNA Isolation Kit (Takara, USA) following the manufacturer's protocols 10 . The NanoDrop 2000 spectrophotometer (Labtech, USA) and 2100 Bioanalyzer were used to check RNA quality.
Whole-genome shotgun sequencing was performed with a single molecule real-time (SMRT) PacBio system. PacBio Sequel II libraries with an insert size of 30 kb were prepared using a SMRTbell Template Prep Kit 2.0. For survey analysis and the error rates associated with long reads, two short paired-end libraries with an insert size of 350 bp were constructed using Truseq DNA PCR-free Kit (Illumina, USA). The next-generation sequence data was generated on the Illumina Hiseq X10 platform. To construct pseudo-chromosomes, the Hi-C library was constructed according to the standard protocols described previously 11 . After quality control, 150 bp paired-end reads (PE150) were obtained using the Illumina Hiseq X10 platform. The cDNA library was constructed using a TruSeq RNA Sample Prep Kit v2 (Illumina, USA) and sequenced on the Illumina Hiseq X10 system using the paired-end strategy.
Genome survey and assembly. A total of 132.39 Gb Illumina short-insert-size data was firstly generated to get a preliminary understanding of the genome characteristics (Table 1). Based on the clean data with duplications removed, the K-mer frequency distribution was calculated with Jellyfish v2.2.6 12 and the results were subsequently analyzed by GenomeScope v2.0 13 . The genome size of T. sibiricus was estimated to be 2.51 Gb with the number of unique K-mers peaked at 21 (Fig. 1). Evaluation of genome characteristics showed the heterozygosity rate of the assembled genome was 0.21% (Table S1).
For PacBio sequencing, approximately 111.63 Gb long reads were obtained after removing adaptors in polymerase reads with default parameters. The mean length and N50 length of PacBio subreads was 35.62 and 24.13 kb, respectively (Table 1). After self-corrected and long read polished, genome initial assembly was performed using Canu v1.8 14   www.nature.com/scientificdata www.nature.com/scientificdata/ (Table 2). To further improve the quality and accuracy of the genome assembly, we corrected the genome by short-read polishing with high coverage of Illumina reads using Pilon v1.23 15 . Total size of the draft genome assembly was 2.64 Gb with an N50 length of 9.43 Mb. For the chromosome-level assembly, 217.38 Gb Hi-C sequencing data was generated and used to anchor contigs into pseudo-chromosomes (Table 1). 3D-DNA v180922 pipeline was used to generate a chromosome-level assembly of the genome 16 . After removing the duplicates, the Hi-C contact map was directly taken as input for 3D-DNA, the location and direction of each contig was determined, and the neighboring contigs were connected using 100 N gaps (100 Ns). Juicebox v1.11.08 (Juicebox Assembly Tools, JBAT) was subsequently used to review and manually curate scaffolding errors 17 . The final size of this genome was 2.64 Gb with a scaffold N50 of 172.61 Mb (Table 2). Results showed that the size of the assembled Siberian chipmunk genome was near to that estimated from the genome survey analysis. Meanwhile, 2.59 Gb data on the base level was anchored and orientated onto 19 chromosomes with a mounting rate of up to 98.03%, and the chromosome lengths ranged from 28.70 to 222.90 Mb (Table 3 and Fig. 2). After scaffolds were clustered, ordered and orientated to restore their relative locations, the heatmap of chromosome crosstalk indicated that the genome assembly was complete and robust (Fig. 1B).
Chromosome synteny. Collinearity analysis of chromosomes between T. sibiricus and two other Xerinae species (Sciurus vulgaris and Sciurus carolinensis) was conducted with LASTZ v1.02.00 18 . As shown in Fig. 3, all 19 pseudochromosomes of T. sibiricus displayed high homology with the corresponding chromosomes of another two squirrels, and two chromosomes (chr11 and chr15 of S. vulgaris, chr11 and chr14 of S. carolinensis) were fused to the chromosome (chr11) in the Siberian chipmunk. Previous studies, using cross-species chromosome painting, showed that the diploid number of chromosomes vary among the species in the superorder Glires (Rodentia and Lagomorpha) 19,20 , with the Siberian chipmunks having 38 chromosomes 9 . Interestingly, the variation seems to follow a certain pattern, such as chromosome 32,34,36,38,40. Combine that with our results of chromosome synteny, chromosome fusions and fissions might occur frequent among genome evolution of Glires. Thus, further studies are needed to determine the molecular mechanism of chromosomal rearrangements and evolution with more available chromosome-level genomic data.
Protein-coding gene annotation. MAKER v3.01.03 pipeline was used to predict protein-coding genes with an integration of three strategies, including ab initio prediction, transcriptome-based annotation and homology-based annotation 27 . The ab initio prediction was generated using the pipeline BRAKER v2.1.5 28 , which automatically trained the predictors Augustus v3.3.4 29 and GeneMark-ET 30 , and made use of the mapped transcriptome data and protein homology information. The transcriptome information in BAM alignments was produced by HISAT2 v2.2.0 31 , and the protein sequences were extracted from the database OrthoDB10 v1 32 . For  www.nature.com/scientificdata www.nature.com/scientificdata/ transcriptome-based annotation, the data of RNA-seq was firstly mapped to our assembly with HISAT2, and the transcriptome information in BAM alignments was produced. With the reference genome of our assembly, the RNA-seq data were further assembled into transcripts using StringTie v2.1.4 33 Table 4. Repeat annotation in the T. sibiricus genome. www.nature.com/scientificdata www.nature.com/scientificdata/ norvegicus) were downloaded from NCBI Refseq database. And all sequences were used as reference required by MAKER for the homology-based prediction. Overall, 25,311 protein-coding genes were predicted with an average gene length of 32,936 bp. The average exon number per gene was 7.52, with average exon length of 171.85 bp, and average intron length of 4850.84 bp. The final gene models predicted above were then annotated using the non-redundant (NR) protein database of NCBI, Swissprot, Pfam, the Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) databases. In total, 23,995 (94.73%) were successfully annotated for at least one homologous hit by searching against these five public databases. Based on BUSCO analysis, 94.4% of the BUSCO database (mammalia_odb10) genes were identified (complete single-copy genes: 92.2%, fragmented genes: 1.5%), further underlining the accuracy and completeness of gene prediction.

Data Records
The genomic Illumina sequencing data was deposited in the NCBI Sequence Read Archive (SRA) database under accession No. SRR19929230 35 . The genomic Pacbio sequencing data was deposited in SRA database under accession No. SRR19961223 36 . The transcriptome Illumina sequencing data was deposited in SRA database under accession No. SRR19961278 37 . The Hi-C sequencing data was deposited in SRA database under accession No. SRR19960530 38 . The assembled genome was deposited in the GenBank at NCBI under accession No. GCA_025594165.1 39 . Genome annotation information of repeated sequences, gene structure and functional prediction is available in the Figshare database 40 .

technical Validation
The completeness and accuracy of the assembled genome were evaluated using two different strategies. First, BUSCO analysis revealed that 92.9% (single-copied gene: 92.2%, duplicated gene: 0.7%) of 9226 single-copy orthologues (in the mammalia_odb10 database) were successfully identified as complete, 1.5% were fragmented and 5.6% were missing in the assembly (BUSCO v4.0.5). Second, we mapped the sequencing data to the assembled genome for verifying the accuracy. The mapping rates was 97.42%, 98.00% and 96.03% for the Illumina, RNA-seq and PacBio data, respectively.

Code availability
No specific script was used in this work. The codes and pipelines used in data processing were all executed according to the manual and protocols of the corresponding bioinformatics software.